Introspective Fault Tolerance for Exascale Systems∗
نویسندگان
چکیده
Faults and errors are an unavoidable aspect of high performance computing systems. Emerging exascale systems will contain billions of hardware components and complex software stacks. In addition, higher fabrication density and power challenges will further compound fault detection, management and recovery. Efficient fault tolerance and resiliency frameworks are thus of immense importance in the path to the exascale era [7].
منابع مشابه
Algorithms and Scheduling Techniques for Exascale Systems
Exascale systems to be deployed in the near future will come with deep hierarchical parallelism, will exhibit various levels of heterogeneity, will be prone to frequent component failures, and will face tight power consumption constraints. The notion of application performance in these systems becomes multi-criteria, with fault-tolerance and power consumption metrics to be considered in additio...
متن کاملToward Exascale Resilience
Over the past few years resilience has became a major issue for HPC systems, in particular in the perspective of large Petascale systems and future Exascale ones. These systems will typically gather from half a million to several millions of CPU cores running up to a billion of threads. From the current knowledge and observations of existing large systems, it is anticipated that Exascale system...
متن کاملCooperative Application/OS DRAM Fault Recovery
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience chal...
متن کاملFault Tolerance Techniques for Scalable Computing∗
The largest systems in the world today already scale to hundreds of thousands of cores. With plans under way for exascale systems to emerge within the next decade, we will soon have systems comprising more than a million processing elements. As researchers work toward architecting these enormous systems, it is becoming increasingly clear that, at such scales, resilience to hardware faults is go...
متن کاملA Global Exception Fault Tolerance Model for MPI
Driven both by the anticipated hardware reliability constraints for exascale systems, and the desire to use MPI in a broader application space, there is an ongoing effort to incorporate fault tolerance constructs into MPI. Several faulttolerant models have been proposed for MPI [1], [2], [3], [4]. However, despite these attempts, and over six years of effort by the MPI Forum’s [5] Fault Toleran...
متن کامل